Approximating the entropy of large alphabets
نویسندگان
چکیده
We consider the problem of approximating the entropy of a discrete distribution P on a domain of size q, given access to n independent samples from the distribution. It is known that n ≥ q is necessary, in general for a good additive estimate of the entropy. A problem of multiplicative entropy estimate was recently addressed by Batu, Dasgupta, Kumar, and Rubinfeld. They show that n = q suffices for a factor-α approximation, α < 1. We introduce a new parameter of a distribution its effective alphabet size qef (P ). This is a more intrinsic property of the distribution depending only on its entropy moments. We show qef ≤ Õ(q). When the distribution P is essentially concentrated on a small part of the domain qef q. We strengthen the result of Batu et al. by showing it holds with qef replacing q. This has several implications. In particular the rate of convergence of the maximum-likelihood entropy estimator (the empirical entropy) for both finite and infinite alphabets is shown to be dictated by the effective alphabet size of the distribution. Several new, and some known, facts about this estimator follow easily. Our main result is algorithmic. Though the effective alphabet size is, in general, an unknown parameter of the distribution, we give an efficient procedure (with access to the alphabet size only) that achieves a factor-α approximation of the entropy with n = Õ ( exp { α · log q · log qef }) . Assuming (for instance) log qef log q this is smaller than any power of q. Taking α → 1 leads in this case to efficient additive estimates for the entropy as well. Several extensions of the results above are discussed. School of Computer Science and Engineering, Hebrew University, Jerusalem, Israel. Electronic Colloquium on Computational Complexity, Report No. 84 (2005)
منابع مشابه
Large Alphabets and Incompressibility
We briefly survey some concepts related to empirical entropy — normal numbers, de Bruijn sequences and Markov processes — and investigate how well it approximates Kolmogorov complexity. Our results suggest lth-order empirical entropy stops being a reasonable complexity metric for almost all strings of length m over alphabets of size n about when nl surpasses m.
متن کاملShannon Entropy Estimation in $\infty$-Alphabets from Convergence Results
The problem of Shannon entropy estimation in countable infinite alphabets is revisited from the adoption of convergence results of the entropy functional. Sufficient conditions for the convergence of the entropy are used, including scenarios with both finitely and infinitely supported distributions. From this angle, four plug-in histogram-based estimators are studied showing strong consistency ...
متن کاملApproximating kCSP For Large Alphabets
In this work we show that the constraint satisfaction problem (CSP), where constraints depend on k variables each and variables range over alphabet of size d, can be efficiently approximated to within Ω(kd/d) for any k,d. Previous work by Makarychev and Makarychev obtained an approximation ratio of Ω(kd/d) only for the case where the alphabet size d is at most exponentially large in k. In contr...
متن کاملAlphabet Partitioning Techniques for Semi-Adaptive Huffman Coding of Large Alphabets Alphabet Partitioning Techniques for Semi-Adaptive Huffman Coding of Large Alphabets∗
Practical applications that employ entropy coding for large alphabets often partition the alphabet set into two or more layers and encode each symbol by using some suitable prefix coding for each layer. In this paper, we formulate the problem of finding an alphabet partitioning for the design of a two-layer semi-adaptive code as an optimization problem, and give a solution based on dynamic prog...
متن کاملOptimal Alphabet Partitioning for Semi-Adaptive Coding of Sources of Unknown Sparse Distributions
Practical applications that employ entropy coding for large alphabets often partition the alphabet set into two or more layers and encode each symbol by using some suitable prefix coding for each layer. In this paper we formulate the problem of optimal alphabet partitioning for the design of a two layer semiadaptive code and give a solution based on dynamic programming. However, the complexity ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Electronic Colloquium on Computational Complexity (ECCC)
دوره شماره
صفحات -
تاریخ انتشار 2005